First telomere-to-telomere gapless assembly of the rice blast fungus Pyricularia oryzae

Rice blast caused by Pyricularia oryzae (syn., Magnaporthe oryzae) was one of the most destructive diseases of rice throughout the world. Genome assembly was fundamental to genetic variation identification and critically impacted the understanding of its ability to overcome host resistance. Here, we report a gapless genome assembly of rice blast fungus P. oryzae strain P131 using PacBio, Illumina and high throughput chromatin conformation capture (Hi-C) sequencing data. This assembly contained seven complete chromosomes (43,237,743 bp) and a circular mitochondrial genome (34,866 bp). Approximately 14.31% of this assembly carried repeat sequences, significantly greater than its previous assembled version. This assembly had a 99.9% complement in BUSCO evaluation. A total of 14,982 genes protein-coding genes were predicted. In summary, we assembled the first telomere-to-telomere gapless genome of P. oryzae, which would be a valuable genome resource for future research on the genome evolution and host adaptation.


Background & Summary
Pyricularia oryzae (syn., Magnaporthe oryzae), an ascomycete fungal pathogen, causes rice blast, one of the most destructive diseases of rice throughout the world 1,2 .The pathogen is an important and long-established model species for understanding fungal-plant interactions 3,4 .Previously, we sequenced and assembled the first genomes of field strains (P131 and Y34) and performed a comparative analysis between the laboratory and field strains, which demonstrated that translocation of transposable elements (TEs), gain or loss of isolate-specific genes and gene family expansion are essential factors, delimiting genomic plasticity and adaptability of P. oryzae 5 .Although these assemblies had facilitated the understanding of the genome characteristics of P. oryzae, the genome of the two strains were highly fragmented to more than one thousand scaffolds, for Sanger (2-fold) and 454 (18-fold) sequencing technologies were used in the previous study.Recently, over 50 genomes of different strains of P. oryzae have been available in public genome databases.These genomes were sequenced on the next-generation sequencing platforms, such as second-generation sequencing platforms (e.g., Illumina sequencers) and/or third-generation sequencing platforms [e.g., Pacific Biosciences (PacBio)], which facilitated the genetic studies of genomic changes and pathogenicity variation within P. oryzae [6][7][8] .However, currently most of these assemblies are fragmented and contain a large number of unplaced contigs and/or gaps owing to the presence of repetitive DNA elements in the P. oryzae genomes, which prevented the dissection of molecular mechanisms of adaptive evolution.Since the importance of genome assembly completeness in genomic analysis, we re-assemble the genome of P. oryzae stain P131 by combining Illumina, PacBio sequencing and high throughput chromatin conformation capture (Hi-C) mapping, which was the first telomere-to-telomere gapless assembly of the P. oryzae genome.
A total of 10.03 Gb PacBio long-read sequencing data (~250x genome coverage) and 4.44 Gb Illumina short-read sequencing data were generated (Table 1).Hi-C library was prepared, sequenced and generated 5.57 Gb sequencing data (~140x genome coverage).The long reads were de novo assembled and corrected.The short reads were used to polish the assembly.Redundant genomic contigs or mitochondrial contigs were then removed.The Hi-C sequencing data were used to anchor and refined remained contigs.The mitochondrial genome was assembled independently by Mitochondrial Long-read Iterative Assembly (MLIA) pipeline 9 .The final polishing of the complete genome was performed.Finally, seven gapless chromosomes (43,237,743 bp with a contig N50 of 7.05 Mb; Fig. 1a) and a circular mitochondrial genome (34,866   the final assembly (Fig. 2).The new assembly represented a significant improvement over the previous version GCA_000292605.1 5,10 (1,823 assembled contigs and contig N50 = 12.3 kb; see Table 2 and Fig. 3).
The nuclear genome was annotated by Braker2 pipeline 11 .The mitochondrial genome was annotated by MFannot 12 using genetic code 4. In conclusion, the nuclear genome is predicted to contain 14,968 genes (including 20,797 transcripts), and the mitochondrial genome is likely to carry 14 conserved protein-coding genes (Table 3).A total of 99.9% of the BUSCOs were mapped onto the P131 genome assembly.Approximately 14.31% of the genome carried repeat sequences, most of which were TEs, which was significantly greater than the previous version (Table 4).
The telomere repeat sequence (TRS) (TTAGGG) n was presented on both ends of chromosomes 2, 4, 5, 6, and 7 and one end of chromosomes 1 and 3 in our assembly.We then compared the TRS in the published near-complete assembled genome of P. oryzae strains with the genome assembly generated in this study.Interestingly, minority deficiency and telomere variability of TRSs in P. oryzae were extensively observed, which may play subtle roles in pathogenic adaptation [13][14][15] .In summary, we assembled the first telomere-to-telomere gapless genome of P. oryzae, which can be instrumental in understanding the genome evolution and host adaptation in the rice blast fungus.

Methods
Sampling and DNA extraction.The P. oryzae strain P131 was grown and maintained on oatmeal tomato agar (OTA) plates 16 .Conidia were produced on OTA plates and harvested from 7-day culture plates grown at 25 °C under constant fluorescent light.Hyphae were collected from 2-day-old cultures in complete medium shaken at 150 rpm at 25 °C.Genomic DNA extracted from vegetative mycelia using cetyltrimethylammonium bromide (CTAB) protocol was used for genome sequencing 17 .template libraries (with fragment size of >10 kb selected).Library quality was detected by Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, Waltham, MA, USA, Q33230).The average fragment size was estimated on Bioanalyzer 2100 (Agilent, Santa Clara, CA).SMRT sequencing was performed on the Pacific Biosciences RSII sequencer (PacBio, Menlo Park, CA) according to standard protocols using the P4-C2 chemistry.A total Table 2. Summary of the genome assembly.a using "fungi_odb10" lineage with 758 BUSCO markers; b using "ascomycota_odb10" lineage with 1706 BUSCO markers.The numbers in parentheses were "complete and fragmented orthologues" and "missing orthologues", respectively.

Illumina
Fig. 3 The alignment of scaffolds from the previous assembly to the new assembled chromosomes.The X and Y axis represented the new chromosomes and previous scaffolds, respectively.  of 10.03 Gb PacBio sequencing data with a subread N50 of 14.5 kb.In addition, Illumina HiSeq X Ten sequencer using paired-end technology was also used to perform genome sequencing and 4.44 Gb sequencing data (150 bp paired-end reads) were yielded at CapitalBio Technology Co., Ltd (Beijing, China).Hi-C library was prepared from cross-linked chromatins of fungal mycelia by Novogene Co., Ltd (Beijing, China).In brief, the tissue was ground and then cross-linked with 4% formaldehyde solution.After the sample of crosslinking reaction and cell lysis, nuclei were digested with 4-cutter restriction enzyme DpnII.Subsequently, ligated DNA was purified and fragmented into 300 bp size on average.The constructed Hi-C library was sequenced by Illumina NovaSeq 6000.5.57 Gb paired-end sequencing data (150-bp length) were generated.The Hi-C maps from raw data were performed by Juicer (v1.6) 18 , followed by using a manually correction with Juicebox (v2.13.07) 19 .

Data Recodes
The raw genomic sequencing data used and/or analyzed during the current study are available at NCBI Sequence Read Archive database (Accession number SRR24890910 36 , SRR24890911 37 and SRR24890912 38 ).The assembled genome was deposited under the same BioProject with P. oryzae strain P131 at NCBI (Accession number: GCA_000292605.2 39 ; BioProject ID: PRJNA82693; BioSample ID: SAMN31867770).The accession numbers from Chr1 to Chr7 chromosome sequences were CP114135 to CP114141, respectively.And the accession number corresponding to the mitochondrial genome sequence was CP114142.

Technical Validation
DNA sample quality.The DNA quality was detected using Qubit (Thermo Fisher Scientific, Waltham, MA) and Nanodrop (Thermo Fisher Scientific, Waltham, MA).Sequencing data assessment.The short read data were assessed by fastp v0.23 40 .The genomic short sequencing reads had 49.75% GC content.The Q20 and Q30 percentages were 97.1% and 92.06%, respectively.The Hi-C sequencing data had 50.5% GC content, and had quality scores of 97.67% (Q20) and 93.64% (Q30), respectively.
Evaluation of the genome assembly.The genome assembly quality was evaluated through the Benchmarking Universal Single-Copy Orthologs (BUSCO) tool with the "fungi_odb10" lineage as a reference dataset.The results showed that 99.9% of all 758 BUSCO markers were assembled, implying a high level of completeness of the assembly.In addition, the results generated from "ascomycota_odb10" lineage showed 99.4% of all 1706 BUSCO markers were include (Table 2).

Fig. 1
Fig. 1 The P. oryzae strain P131 genome assembly.(a) Nuclear genome: Track 1 illustrates the seven assembled nuclear chromosomes with the indicated sizes.The red arrows at the end of chromosomes indicate telomeric repeat sequences (TTAGGG) n .Tracks 2 through 4 show transposon distribution, gene density, and GC density on the seven chromosomes, respectively.(b) Mitochondrial genome: Track 5 depicts a circular illustration of the mitochondrial genome carrying genes (represented by different color blocks), including genes encoding 14 standard fungal core PCGs (nad1-nad6, nad4L, cob, cox1-cox3, atp6, atp8, and rps3), ribosomal subunits (rns and rnl) and 27 tRNA genes.Tracks 6 and 7 display the introns present in the above genes and GC density of the mitochondrial genome, respectively.

Fig. 2
Fig. 2 An illustration of Hi-C genome-wide interaction map result.

Table 1 .
bp; Fig.1b) were constructed in Summary of sequencing raw data of P. oryzae strain P131.

Table 3 .
Detailed summary of assembled chromosomes.

Table 4 .
Classification of repeat sequences.